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Abstract 

Background: Protein stabilities can be affected sometimes by point mutations introduced to the protein. Current 
sequence-information-based protein stability prediction encoding schemes of machine learning approaches include 
sparse encoding and amino acid property encoding. Property encoding schemes employ physical-chemical 
information of the mutated protein environments, however, they produce complexity in the mean time when 
many properties joined in the scheme. The complexity introduces noises that affect machine learning algorithm 
accuracies. In order to overcome the problem we described a new encoding scheme that graded twenty amino 
acids into groups according to their specific property values. 

Results: We employed three predefined values, 0.1, 0.5, and 0.9 to represent 'weak', 'middle', and 'strong' groups 
for each amino acid property, and introduced two thresholds for each property to split twenty amino acids into 
one of the three groups according to their property values. Each amino acid can take only one out of three 
predefined values rather than twenty different values for each property. The complexity and noises in the 
encoding schemes were reduced in this way. More than 7% average accuracy improvement was found in the 
graded amino acid property encoding schemes by 20-fold cross validation. The overall accuracy of our method is 
more than 72% when performed on the independent test sets starting from sequence information with three-state 
prediction definitions. 

Conclusions: Grading numeric values of amino acid property can reduce the noises and complexity of input 
information. It is in accordance with biochemical concepts for amino acid properties and makes the input data 
simplified in the mean time. The idea of graded property encoding schemes may be applied to protein related 
predictions with machine learning approaches. 



Background 

Protein thermodynamic stability change upon single 
point mutations is a crucial problem that affects most 
protein engineering and molecular biology researches. 
Significant numbers of different prediction methods 
have been developed to predict the protein stability free 
energy change (AAG) in last decades. While energy 
function-based approaches and statistical analysis were 
employed to compute the stability free energy change 
[1-14], machine learning approaches attracted more 
attention for increasing number of available 
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experimental thermodynamic data in the ProTherm 
database [15-21]. Given the tertiary structure available, 
structure information based approaches generally per- 
formed better than sequence information based 
approaches in machine learning approaches [19]. The 
number of known protein structures, however, is less 
than one percent (0.45%) of the number of known pro- 
tein sequences. Current UniProtKB/TrEMBL Release, 
2011_08 of 27-Jul-2011, contained 16,504,022 entries of 
protein sequences while there were only 75,105 struc- 
tures in PDB till 5 p.m., Tuesday Aug 09, 2011. Most of 
the available information about proteins is still restricted 
in their sequence information. Sequence-based protein 
stability prediction methods attracted more research 
interests [1-7,15-19]. 



o 



© 2012 Liu and Kang; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons 
BlolVICCl Central Attribution License (http://creativecommons.0rg/licenses/by/2.O), which permits unrestricted use, distribution, and reproduction in 
any medium, provided the original work is properly cited. 



Liu and Kang BMC Bioinformatics 2012, 13:44 
http://www.biomedcentral.eom/1 471 -2 1 05/1 3/44 



Page 2 of 1 1 



Sequence-based protein stability prediction methods 
usually captured the mutation site environments with slid- 
ing window strategy with fixed length of the protein 
sequence that centred on the target residue. The encoding 
schemes of the sliding window strategy can be grouped 
into two categories. The first is the sparse encoding 
schemes that represent each amino acid with twenty dis- 
tinct input units [16-19]. The second is the amino acid 
property encoding schemes, which integrate the physical- 
chemical properties of amino acids into machine learning 
input information [15]. Rather than representing amino 
acids with 20 characters, the property encoding schemes 
employ amino acid physical-chemical properties and 
usually perform better. There are 20 different numbers 
that represent each property in the property encoding 
schemes. If 15 properties were used, there would be 300 
different values for each input node. Suppose 31 is the 
sliding window length, there would be 9,300 possible com- 
binations for each vector. Too much information would 
be noises to machine learning algorithm. 

A possible way to improve the classification task is to 
try to insert more information in the input code and 
simultaneously try to refine the quality of the discrimi- 
nated features. Although each amino acid property can 
take different numbers, from the physical-chemical 
point of view, they can be partitioned into three groups: 
strong, middle or weak group. For example, each amino 
acid's hydrophobicity can be strong, middle or weak 
hydrophobicity. If we reduce the number of values for 
each property to 3, the input information to the algo- 
rithm would be much simplified. 

Here we developed a property grading method to dif- 
ferentiate the amino acids and reduce the noises of the 
amino acid properties. We found that the property grad- 
ing method performed better with the traditional cross- 
validation test and the current independent test sets. 

Results and discussions 

Three-state prediction definitions 

There were 'two-state predictions' and 'three-state pre- 
dictions' in the protein stability prediction field. In two- 
state predictions, prediction results were presented as 
stability "increase" or "decrease"; while in the three-state 
predictions, the results were presented as stability 
"increase", "neutral" or "decrease". Although the accu- 
racy scores with two-state predictions usually showed 
higher, three-state predictions are more reasonable in 
molecular biology point of view. We adopted Capriotti's 
'three-state prediction' definition [19] for all of our 
experiments. 

Cross validation results with different encoding schemes 

Cross validations with one dataset were believed to be 
the strictest approach to evaluate different encoding 



schemes. To avoid similarity sequences appearing in 
both the training and test set at the same time, the 
sequences were blasted themselves with the dataset 
sequence database and grouped with their similarities. 
The sequences with similarity > 25% in blast results 
were clustered into groups. The groups were randomly 
selected to a test set. The corresponding training set 
sequences came from the dataset sequences that were 
not in the test set. After implementing different encod- 
ing schemes and training-test procedures, twenty round 
cross-validation prediction accuracies were averaged for 
each encoding scheme. 

It is generally held that amino acid physical-chemical 
property encoding is better than sparse encoding (arbi- 
trary numeric representation of amino acids) because 
amino acid properties take intrinsic meanings of nature. 
However, there could be two problems in the property 
encoding schemes. The first problem could come from 
the property components to be used. When only one 
property was adopted, such as hydrophobicity property 
(K-D in Table 1), the total effects of the prediction 
could not reach high accuracies. The protein secondary 
structure propensity factors take information from the 
protein structure and are expected to be helpful in the 
input information. However, when they were used 
alone, we could hardly get good performance either 
(HEC in Table 1). We used physical-chemical 11-factors 
encoding which showed almost the same results with 
the sparse encoding. The sparse encoding scheme 
(sparse in Table 1) was used as the control in our 
experiment. When physical-chemical properties and 
structural propensities combined together, better perfor- 
mance was achieved. AApropertyl5 showed a good 
example of such combinations of the amino acid prop- 
erties. The overall accuracy (Q 3 ) of amino acid property 
encoding scheme (AApropertyl5 in Table 1) was 3% 
higher than that of sparse encoding schemes. On the 
other hand, however, it is not true that the more prop- 
erty factors the better. We ever tried as many as 48 fac- 
tors from aaindex [22] in the encoding scheme and the 
results showed no improvement to the prediction 
accuracies (data not shown). 

The second problem that embarrassed the property 
encodings could come from the noises and the data 
complexities from the input factors. Grading the prop- 
erty numeric values can reduce the noises from the 
input factors and achieve better performances. When 
the properties were graded into three classes and repre- 
sented by three distinct numbers (AApropertyl5Grade 
in Table 1), the predictions presented better results. Q 3 
of AApropertyl5Grade was 4% higher than that of non- 
graded schemes (AApropertyl5 in Table 1). In general, 
the graded property encoding scheme achieved 7% bet- 
ter than sparse encoding scheme in prediction 



Table 1 Cross-validation performance of the sequence-based SVM method of different encoding schemes 



Encoding scheme Q 3 MCC Q(N) Q(+) Q(-) Specificity (N) Specificity Specificity PPV PPV PPV NPV NPV NPV MCC MCC MCC 
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56.81 
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81.10 


65.35 
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56.92 


0.28 


59.07 


52.32 


51.27 
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80.77 
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65.46 


60.17 


63.32 


72.32 
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0.32 


0.33 


HEC 


56.91 


0.29 


58.32 


50.56 


52.47 


66.55 


81.39 


80.34 


65.28 


65.43 


63.45 


65.79 


74.56 


71.57 


0.21 


0.32 


0.32 


K-D 


55.98 


0.25 


57.81 


51.64 


49.73 


63.72 


78.29 
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63.54 


62.14 


63.30 


66.57 


73.22 
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0.20 


0.34 


0.31 


AAproperty15 


59.57 


0.31 


61.72 


56.13 


57.40 


60.96 


79.57 


81.48 


68.16 


65.89 


67.87 


65.02 


76.83 


76.71 


0.30 


0.35 


0.34 


AAproperty! 5Grade 


63.63 


0.36 


64.15 


58.23 


57.62 


61.95 


80.35 


82.07 


69.81 


62.52 


69.12 


67.18 


78.31 


78.96 


0.34 


0.39 


0.36 



All numbers except MCC represent per cent values. +, - and N: the indexes are evaluated for increasing, decreasing or neutral protein free energy stability change, respectively according to the classification 
described in section 2 of Results and Discussions; for the definition of the different indexes see the Scoring the performance in Methods. 11 data from Capriotti [19] 
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accuracies. Matthew's correlation coefficient (MCC) 
showed improvements also. With the graded property 
encoding scheme, the sequence based method can even 
be competitive with the structure based approach (Q 3 
61% and Mcc 0.35 [19]) in the three-state mutation sta- 
bility predictions. 

Test results on independent test datasets 

Dataset DBSEQ_Sep05 was used to make our prediction 
model. When evaluating its performance, the chosen 
independent test sets were blasted against the DBSEQ- 
Sep05 sequence database. Mutation samples were 
deleted from the chosen independent test set that the 
sequences share similarities bigger than 25% with the 
ones in the Additional file 1: DBSEQ_Sep05 dataset. 
1132 sequence similarity mutations, for example, were 
deleted from the Potapov data set (2153 mutation sam- 
ples in 79 proteins), and the resulted independent test 
set Additional file 2: clean.Potapov retained only 1021 
mutations in 50 protein chains. The statistics and expla- 
nation of the clean independent test sets were shown in 
Additional file 3: Table SI and S2. 

Table 2 showed the prediction accuracies when pre- 
dicted the clean independent test sets with the graded 
property encoding DBSEQ_Sep05 model. Average accu- 
racy of Q 3 72.55% explained the advantage of graded- 
property encoding scheme, which is highest accuracy 
that can be found in the literature with three-state 
predictions. 

ROC comparisons 

When the sparse encoding and amino acid property 
encoding schemes are considered, a slight improvement 
of amino acid property encoding scheme is detected. 
This can be seen from both the stabilizing/destabilizing 
and neutral mutation ROC curves (Figure 1). In the case 
of comparing graded amino acid property encoding vs. 
amino acid property encoding, the AUC of graded 
amino acid property is evident bigger than that of 



amino acid property encoding scheme in the stabilizing/ 
destabilizing mutations (Figure 1A). 

ROC curves for the three encoding schemes. The 
cross-validation True Positive Rate (TPR) versus the 
False Positive Rate (FPR) are plotted for the sparse, the 
property and the graded property encoding schemes. In 
part (A), the ROC curves of the three encoding schemes 
are relative to the stabilizing and destabilizing muta- 
tions, while in part (B), the curves represent neutral 
mutations. The solid lines are the average values for 
independent tests of the scheme; and the dashed lines 
are the test instances to show the distributions of the 
test values. The vertical bars represent standard errors 

Conclusions 

Physical-chemical properties of amino acids take intrin- 
sic meanings of nature, which make proteins present 
common characteristics of life. Numerical representa- 
tions of the properties come from the real world experi- 
ments and are the results of balanced multiple physical 
forces. The amino acid physical-chemical property 
encodings, if well used in protein related predictions, 
should be better approaches than factitious encodings 
like sparse encoding, arbitrary numeric representations 
of amino acids. 

The graded physical-chemical property approach dis- 
criminates amino acids into strong, middle, or weak 
groups according to their specific property values. It is 
in accordance with biochemical concepts for amino acid 
properties, and makes data simplified in the mean time. 
The idea of grading properties may be applied to protein 
related predictions with machine learning approaches. 

Methods 

Data descriptions 

Experimental data in the ProTherm database [21] are 
affected by errors. When the value of the free energy 
change is close to 0 and the associated error is consid- 
ered, for one single measure the sign of AAG can 



Table 2 Performance on independent datasets 
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76.28 
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70.57 
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64.47 
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0.48 


0.58 
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Average 


72.55 


0.53 


79.61 


66.11 


65.06 


69.57 
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0.49 


0.60 


0.54 



For notation see Table 1. Independent test set details and statistics see Table SI and S2 
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Figure 1 ROC curves for different encoding schemes of the sequence-based predictor. 



change from decreasing to increasing and vice versa. 
The AAG threshold value for mutation classification is 
limited to the value of standard errors reported in 
experimental works. In accordance with Capriotti's cri- 
teria [18], we used |1.0| kcal/mole as the threshold for 
classifications. According to its experimental AAG value 
each mutation sequence is grouped into one of the fol- 
lowing three classes: 

i) destabilizing mutation, when AAG < -1.0 kcal/mole; 

ii) stabilizing mutation when AAG > 1.0 kcal/mole; 

iii) neutral mutations when -1.0 < AAG < 1.0 kcal/ 
mole. 

The data set compiled by Capriotti [19], named 
DBSEQ_Sep05 data set, was used to develop our mod- 
els. S1615, S388 data sets [16], PoPMuSiC [9], Potapov- 
DB dataset [8] and TEST_May09 were chosen for inde- 
pendent performance comparisons. 

DBSEQ_Sep05 data set contained 1623 different sin- 
gle point mutations and related experimental data for 
58 different proteins. Among these mutations, there 
were 138 stabilizing mutations, 663 destabilizing muta- 
tions, and 822 neutral mutations. The samples for 
three classes were quite unbalanced and they would 
lead bias in the model training in machine learning 
procedures. From the point of view of basic thermody- 
namics, a protein and its mutated form should be 
endowed with the same free energy change, irrespec- 
tively of the reference protein (native or mutated). 
Hence, we can assume that the module of free energy 
change is the same in going from one molecule to the 
other and that what changes is only the AAG sign. 
The problem of the asymmetric abundance of the 
three classes was overcome by reversing the mutation 
AAG sign, we doubled the stabilizing/destabilizing 



samples and got 801 stabilizing, 801 destabilizing, 822 
neutral mutation samples. 

S1615 data set was compiled from an earlier version of 
the ProTherm release and thus included less data when 
compared with data set DBSEQ_Sep05. The S388 data 
set is a subset of S1615, containing only physiological 
condition data derived under temperatures from 20.8°C 
to 40.8°C and pH values from 6 to 8. 

PoPMuSiC dataset was compiled by PoPMuSiC-2.0 [9] 
with 2648 different point mutations in 131 proteins. 
Only mutations in globular proteins were considered, 
PoPMuSiC dataset was believed to be non-redundant 
data set itself for defining as a weighted average of all 
available AAG values in favor of normal experiment 
conditions including temperature and pH when mutants 
taking variant AAG values. 

Potapov-DB dataset [8] contained 2155 mutations in 
79 proteins. Single- and multi-site mutations were con- 
sidered. Potapov-DB removed redundant data by aver- 
aging free energy change (AAG) of the mutants when 
multiple data available. 

All the above datasets were compiled with different 
constraints and conditions by different people. The data- 
sets could be non-redundant themselves; however, they 
were searched from the same ProTherm database and 
could share some homologues to each other. In order to 
give a fair and controllable independent assessment of 
our model, we built a new dataset TEST Mayll from 
the updated ProTherm database (from September 2005 
to May 2011) with Capriotti's[19] searching constraints: 
only single point mutations; reversible experiments; and 
the AAG value with known experimental conditions 
(temperature and pH). The training set DBSEQ_Sep05 
was built early by Capriotti in September 2005. 
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TEST Mayll contains 1004 mutations in 51 proteins 
with 375 destabilizing mutations (AAG < -1.00 Kcal/ 
mol), 61 stabilizing mutations (AAG > +1.00 Kcal/mol) 
and 568 neural mutations (-1.00 < = AAG < = +1.00 
Kcal/mol). 

Data sets clean up 

To avoid the introduction of mutations that share simi- 
larity with those of the DBSEQ_Sep05 training set, the 
independent data sets TEST_Mayll, S388, Potapov, 
PoPMuSic, and S1615 were blasted against the DBSEQ_- 
Sep05 seq58-protein database. Mutation samples were 
deleted from the test set that share sequence similarities 
(identity > 25%) with mutation site in 

q. start ~ q.end sequence region in the blast results 
(Additional file 4: blast.independentl75.against.seq58). 
For example, 934 redundant mutations were deleted 
from PoPMuSic data set (2648 mutation samples in 134 
proteins), and the resulted data set clean. PoPMuSic 
retained only 1712 mutations in 109 protein chains. 
After removing all these sequence similarity mutation 
samples, we got the "clean" test sets: Additional file 2; 
clean. TEST_May 11, clean.S388, clean. Potapov, clean. 
PoPMuSic, and clean.S1615. The test files can be found 
in the supplementary materials of the paper. The statis- 
tics and explanation of the clean test sets were shown in 
Additional file 3: Table SI and S2. The clean datasets 
were used to evaluate our prediction model. 

Balancing mutation samples 

Experimental data in the ProTherm database are intrin- 
sically non symmetric and unbalanced, with destabilizing 
mutations outnumbering stabilizing ones. Unbalanced 
training samples would result in poor accuracy on the 
minority/positive samples in machine learning such as 
SVM. This is because the class-boundary learned by the 
SVM is skewed towards the majority/negative class, 
which may lead to many positive examples being classi- 
fied as negative (false negatives). From the point of view 
of basic thermodynamics, a protein and its mutated 
form should be endowed with the same free energy 
change. The problem of the asymmetric abundance of 
the three classes can be solved by reversing mutation 
(namely the mutation that transforms back the mutant 
into the original protein) by considering the value of the 
experimental measure with the opposite sign (-AAG). 

20-fold Cross validation test 

The data set DBSEQ_Sep05 was adopted in our experi- 
ments to make cross validation tests for different encod- 
ing schemes. In order to make similarity sequences in 
the same partition, DBSEQ_Sep05 sequences were 
blasted themselves with the DBSEQ_Sep05 sequence 
database. The results were shown in Additional file 5: 



blast. DBSEQ_Sep05. With similarity > 25%, the mutation 
samples were clustered into 58 groups. The groups were 
random selected and joined to make a test set. The cor- 
responding training set to the test set was produced 
from the data set DBSEQ_Sep05 by finding entries that 
were not in the test set. The groups, test sets and com- 
plementary training sets were explained in the Addi- 
tional file 6: blast.group.DBSEQ_Sep05 and Additional 
file 7: TrainTestSet.description. The "serials" in the 
explanations corresponded to the sample entries in the 
Additional file 1: DBSEQ_Sep05.txt dataset. The "group" 
was the GROUP number defined in Additional file 6: 
blast.group.DBSEQSepOS. The test/training sets were 
then balanced with reversing the AAG sign with the cri- 
teria of AAG < -1.0 or AAG > 1.0. The encoding 
schemes applied to each test/training set afterwards. 
Each round of the cross validation test consisted of 
twenty iterations of the training/test procedure. Twenty 
round cross validations were accomplished for each 
encoding scheme, and the test accuracies were averaged 
for the scheme. 

The predictors 

The LibSVM package 2.82 [23] was used for SVM train- 
ing and prediction. The radial basis function (RBF kernel 
= exp[-G || x, - x,- 1 1 2 ]) was used as kernel function in the 
experiment. The cost parameter C and kernel parameter 
g were optimized with the package built-in tool grid, 
which would take several hours for each training subset. 
The optimized C and g values were determined by grid 
results and were different from subset to subset depend- 
ing on the data distributions of the specific random parti- 
tions. C values varied from 2 to 32768 and g values from 
0.0078125 to 2.0 from our lab record and theoretically 
they could go even farther. The optimized C and g were 
used to train LibSVM with the training subset and a 
model resulted. The model was used to predict protein 
stabilities with the corresponding test subset. A given sin- 
gle point protein mutation was classified in one of the 
three classes: stabilizing, destabilizing and neutral. The 
classes were represented by three labels: "0" for stabiliz- 
ing mutations (AAG > 1.0 kcal/mole), "1" for destabiliz- 
ing mutations (AAG < -1.0 kcal/mole) and "2" for neutral 
mutations (-1.0 < AAG < 1.0 kcal/mole). 

Input vectors and encoding schemes 

One of important steps in machine learning approaches 
is to encode the raw materials data into format data 
that can be recognized by machines. To encode the 
mutated position and the surrounding environments of 
the position into vectors, we employed the deleted resi- 
due, the introduced residue, the environment window 
amino acids around the mutated position, experimental 
pH and temperature, etc. 
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Sparse encoding scheme 

The most widely-used representation of an amino acid 
sequence in bioinformatics modelling is the "sparse 
encoding" scheme [19]. The input vector consists of 42 
values. The first 2 input values account respectively for 
the temperature and the pH at which the stability of the 
mutated protein was experimentally determined. The 
next 20 values (for 20 residue types) explicitly define the 
mutation, setting to -1 the element corresponding to the 
deleted residue and to 1 the new residue (all the 
remaining elements are kept equal to 0). The last 20 
input values encode the residue environment: each of 
the 20 input values is the number of the encoded resi- 
due type found inside a symmetrical window centred at 
the mutated residue, spanning the sequence towards the 
left (N-terminus) and the right (C-terminus), for a total 
length of 31 residues [19]. 
11 -factor encoding scheme 

Sparse encoding scheme represents amino acids with 
different numbers and the numbers themselves having 
no relation with the physicochemical properties of the 
amino acids. Leucine, for instance, have similar polarity 
with isoleucine but quite different from glutamic acid. 
However, Leu, He and Glu have same status in sparse 
encoding scheme by taking different numbers. Sparse 
encoding scheme does not account for any similarity in 
physicochemical properties between amino acids. Liu 
W. et al. successfully used amino acid property encod- 
ing schemes with support vector machines [24]. They 
extracted 17 amino acid physicochemical parameters 
from AAindex, after eliminating related properties with 
correlation coefficient factor (r > 0.8), and got a good 
performance with 11 factors, which were linearly scaled 
to the range of [0,1] from the raw data. We used their 
11-factor encoding scheme in our experiment. 
HEC encoding scheme 

Chou-Fasman's amino acid propensity parameters to 
protein secondary structure conformation, namely helix 
propensity (He), sheet propensity (Ee), and coil propen- 
sity (Ce) [25], were recalculated with modern non- 
redundant protein secondary structure dataset 

CB513 [26] and RS126 [27]. To test the amino acid 
conformation propensity properties in our experiment, 
the propensity parameters were transformed into the 
range [0,1] with 1/(1 + e' x ) formula. 
K-D encoding scheme 

Amino acid hydrophobicity was believed to be one of 
the most important properties to maintain the protein 
tertiary structure. Kyte and Doolittle's hydrophobicity 
scale [28] was used to test a single amino acid property 
effect in prediction. We transformed the Kyte and Doo- 
little's data into the range [0,1] with 1/(1 + e~ x ) formula, 
and named as K-D encoding scheme. 



Property encoding scheme (AApropertyl 5) 

The 11-factor amino acid properties were combined with 
the amino acid secondary structure conformation propen- 
sity parameters He, Ee, and Ce. To emphasize hydrophobi- 
city's proportion in its influence in protein structure, Kyte- 
Doolittle hydrophobicity scale was added to the encoding 
scheme also. A list of 15 factors was obtained (Table 3). 
We named the encoding scheme as "AApropertyl 5". 
Graded property encoding scheme (AApropertyl 5Grade) 
Comparing with sparse encoding, complexity may be the 
problem introduced by property encoding scheme. For 
each property, amino acids take 20 different values. 15 
properties and window length of 31 can make 9300 
values. In addition to encoding the deleted residue, the 
new residue, temperature and pH, property encoding 
scheme introduced complexity while there are numerous 
benefits and advantages associated with the scheme. 

According to a specific physicochemical property, all 
amino acids can usually be grouped into strong, middle, 
or weak classes. For hydrophobicity, we can have strong 
hydrophobic, middle hydrophobic, and weak hydropho- 
bic amino acids. The amino acid numeric representa- 
tions of each property can be partitioned into three 
groups if we define two numeric thresholds. 

Rather than direct using the amino acid property 
numeric values in the encoding scheme, we define three 
distinct numbers to represent the strong, middle, or weak 
classes. When the numeric representation is less than the 
lower limit, we represent the amino acid as 0.1; when 
greater than the upper limit, we represent the amino acid 
as 0.9; when the numeric is equal or greater than lower 
limit but equal or less than upper limit, the amino acid is 
represented as 0.5, as shown in Equation 1. The lower 
limit and the upper limit are arbitrary numbers that can 
partition 20 amino acids evenly into three groups accord- 
ing to the distribution of the property numeric values. 



S? 



0.1 

0.5 
0.9 



if Pf < U 
if Li < P? < Ui 
ifP? > U t 



(1) 



Where S" is the score used in the coding scheme, Pf 
is the numeric value of property i of amino acid a, Li is 
the lower limit of property i, and Ui is the upper limit 
of property i. 

For each property, two thresh-holds partition twenty 
amino acids into three classes: weak, middle, or strong 
class. Each amino acid took one out of three rather than 
one out of twenty different numbers for each property. 
The complexity and noises can be much reduced in this 
way. Table 3 showed fifteen amino acid property encod- 
ing values and Table 4 showed scores used in the 
graded encoding schemes, which were derived from 



Table 3 The amino acid property scores used in the AAproperty15 encoding scheme 
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Table 3 with two thresholds. The thresholds used were 
arbitrary and the intention was to get as equal number 
of the amino acids in each group as possible. The 
thresholds (lower limit/upper limit) are: steric parameter 
0.65/0.7; hydrogen bond donors 0.35/0.66; hydrophobi- 
city scale 0.25/0.65; hydrophilicity scale 0.18/0.55; aver- 
age accessible surface area 0.2/0.5; van der Waals 
parameter R 0 0.3/0.7; van der Waals parameter Epsilon 
0.1/0.5; free energy of solution in water 0.4/0.55; average 
side chain orientation angle 0.3/0.7; polarity 0.02/0.5; 
isoelectric point 0.3/0.401; He 0.72/0.76; Ee 0.67/0.8; Ce 
0.7/0.75; and KDe 0.1/0.8. 

Scoring the performance 

Seven indices, total accuracy(sensitivity) (Q3) (Equation 
2) and total Matthew's correlation coefficient (MCC) 
(Equation 3) [29], the accuracy(sensitivity) (Q) (Equation 
4), specificity(Equation 5), positive predictive value 
(PPV) (Equation 6), negative predictive value(NPV) 
(Equation 7), MCC (Equation 8), were calculated for the 
assessment of the prediction system. 
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(5) 



(6) 



(7) 



(8) 



Here, i is the any subfamily, N is the total number of 
sequences, k is the subfamily number, p(i) is the number 
of correctly predicted sequences of subfamily i, n{i) is 
the number of correctly predicted sequences not of sub- 
family i, u{i) is the number of under-predicted 
sequences, and o(i) is the number of over-predicted 
sequences, in other words, p(i) = TP, n(i) = TN, u(i) = 
FN, o(i) = FP. 



MCC tota i 



E (P(0 + u{i))MCC{i) 
1=1 

N 



(3) 



Multi-class ROCR 

Currently, ROCR supports only binary classification 
[30,31], if there are more than two distinct label sym- 
bols, execution stops with an error message. To over- 
come the binary classification limit of ROCR package, 



Table 4 The graded amino acid property encoding scheme AApropertyl 5Grade 
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we defined functions split.class and split.probabilities to 
split classes and probabilities independently and make 
the data become one-against-rest. We collected the 
three class probabilities with ROCR built-in function 
predict (probability = TRUE). With list function, we 
then joined the independent data of three classes 
together and plot ROC curve. The ROC curve can then 
represent the three class classification. User defined 
functions can be found in the Additional file 8: multi. 
class.functions. rocr. 

Additional material 
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